5 research outputs found
A reference haplotype panel for genome-wide imputation of short tandem repeats.
Short tandem repeats (STRs) are involved in dozens of Mendelian disorders and have been implicated in complex traits. However, genotyping arrays used in genome-wide association studies focus on single nucleotide polymorphisms (SNPs) and do not readily allow identification of STR associations. We leverage next-generation sequencing (NGS) from 479 families to create a SNP + STR reference haplotype panel. Our panel enables imputing STR genotypes into SNP array data when NGS is not available for directly genotyping STRs. Imputed genotypes achieve mean concordance of 97% with observed genotypes in an external dataset compared to 71% expected under a naive model. Performance varies widely across STRs, with near perfect concordance at bi-allelic STRs vs. 70% at highly polymorphic repeats. Imputation increases power over individual SNPs to detect STR associations with gene expression. Imputing STRs into existing SNP datasets will enable the first large-scale STR association studies across a range of complex traits
Recommended from our members
The impact of short tandem repeat variation on gene expression.
Short tandem repeats (STRs) have been implicated in a variety of complex traits in humans. However, genome-wide studies of the effects of STRs on gene expression thus far have had limited power to detect associations and provide insights into putative mechanisms. Here, we leverage whole-genome sequencing and expression data for 17 tissues from the Genotype-Tissue Expression Project to identify more than 28,000 STRs for which repeat number is associated with expression of nearby genes (eSTRs). We use fine-mapping to quantify the probability that each eSTR is causal and characterize the top 1,400 fine-mapped eSTRs. We identify hundreds of eSTRs linked with published genome-wide association study signals and implicate specific eSTRs in complex traits, including height, schizophrenia, inflammatory bowel disease and intelligence. Overall, our results support the hypothesis that eSTRs contribute to a range of human phenotypes, and our data should serve as a valuable resource for future studies of complex traits
Recommended from our members
Deep Characterization of the Contribution of Short Tandem Repeats Across Tissues
High-Throughput Sequencing (HTS) and Genome-Wide Association Studies (GWAS) studies have given us unprecedented insights into the influence of Single Nucleotide Variants (SNV) and Copy Number Variants (CNV) on different phenotypes including gene expression, diseases, and complex traits. However, how other complex genetic variations such as Short Tandem Repeats (STRs) in the genome may affect gene expression remains largely unknown. Identifying and genotyping these types of variants from short DNA sequencing reads or low coverage data present difficult bioinformatics challenges. Additionally, traditional association tests must be modified to handle highly multi-allelic loci such as STRs. Several studies have examined the effect of STRs on gene expression genome-wide. However, these studies were restricted to a single cell type such as whole blood or lymphoblastoid cell lines (LCLs) and had limited power to detect associations due to low-quality genotypes. Thus, the results of these studies have had limited biological insights and interpretation in different contexts. In this dissertation, we address the importance of incorporating STRs in causal screening and large-scale medical genetics studies. We perform the first and largest yet characterization of STRs that contribute to gene expression variation across multiple tissues. To assure robust and reliable outcomes and insights, we leverage data from the GTEx project, which has collected high coverage whole genome sequencing data and RNA-sequencing across dozens of tissues, for more than 600 individuals. Our work confirms a clear contribution of STRs to gene expression regulation, with 25,554 eSTRs identified across 17 tissues. Of these, 14% are identified as high confidence causal variants after fine-mapping against nearby SNPs. eSTRs are highly enriched at predicted promoter and enhancer regions and for motifs with high GC-content. We identified a subset of eSTRs capable of forming G-quadruplexes (G4), a highly stable DNA secondary structure known to be involved in gene regulation. We show that long G4-forming STRs tend to increase expression of nearby genes, potentially by lowering the free energy of promoter regions and promoting RNA polymerase II stalling. Finally, we identify high-confidence eSTRs that likely underlie previously identified genetic associations with complex phenotypes including schizophrenia and blood-related traits
Deep Characterization of the Contribution of Short Tandem Repeats Across Tissues
High-Throughput Sequencing (HTS) and Genome-Wide Association Studies (GWAS) studies have given us unprecedented insights into the influence of Single Nucleotide Variants (SNV) and Copy Number Variants (CNV) on different phenotypes including gene expression, diseases, and complex traits. However, how other complex genetic variations such as Short Tandem Repeats (STRs) in the genome may affect gene expression remains largely unknown. Identifying and genotyping these types of variants from short DNA sequencing reads or low coverage data present difficult bioinformatics challenges. Additionally, traditional association tests must be modified to handle highly multi-allelic loci such as STRs. Several studies have examined the effect of STRs on gene expression genome-wide. However, these studies were restricted to a single cell type such as whole blood or lymphoblastoid cell lines (LCLs) and had limited power to detect associations due to low-quality genotypes. Thus, the results of these studies have had limited biological insights and interpretation in different contexts. In this dissertation, we address the importance of incorporating STRs in causal screening and large-scale medical genetics studies. We perform the first and largest yet characterization of STRs that contribute to gene expression variation across multiple tissues. To assure robust and reliable outcomes and insights, we leverage data from the GTEx project, which has collected high coverage whole genome sequencing data and RNA-sequencing across dozens of tissues, for more than 600 individuals. Our work confirms a clear contribution of STRs to gene expression regulation, with 25,554 eSTRs identified across 17 tissues. Of these, 14% are identified as high confidence causal variants after fine-mapping against nearby SNPs. eSTRs are highly enriched at predicted promoter and enhancer regions and for motifs with high GC-content. We identified a subset of eSTRs capable of forming G-quadruplexes (G4), a highly stable DNA secondary structure known to be involved in gene regulation. We show that long G4-forming STRs tend to increase expression of nearby genes, potentially by lowering the free energy of promoter regions and promoting RNA polymerase II stalling. Finally, we identify high-confidence eSTRs that likely underlie previously identified genetic associations with complex phenotypes including schizophrenia and blood-related traits